Text Retrieval
and Mining
# 11-1
Information Extraction
Lecture by Young
Hwan CHO, Ph. D.
Youngcho@gmail.com
Page 2
Plan for Today
- Information Extraction
- Introduction to the
IE problem
- Wrappers
- Wrapper Induction
- Traditional NLP-based
IE
- Pattern
Learning Systems: Rapier
- Probabilistic
sequence models: HMMs
Page 3
What is Information Extraction?
- the extraction or
pulling out of pertinent information from large volumes of texts
- 어떠한 문서를 사용자가 읽어야 한다는 것을 알려주기보다는 사용자에게 필요한 부분의 정보의 조각을 추출하고, 추출된 정보와 원래의 문서간의 링크를 유지해서 사용자가 내용을 참조하도록 링크하는 것
- 이러한 정보는 신뢰성이 높고 자세하여야 하는데, 최근의 기술로는 아래와 같은 수준을 보인다.
an activity or occurrence of interest
such as a terrorist act or an airline crash
a relationship held between two
or more entities
a property of an entity such as
its name, alias, descriptor, or type
an object of interest such as
a person or organization
Definitions
60
Events
70
Facts
80
Attributes
90
Entities
Percentile Reliability
Items of Information
Page 4
IE from the Web: The Big Picture
Page 5
Information Extraction의 컴포넌트
- Spider : 웹 페이지 수집
- 대상이 되는 웹페이지를 수집, 다음 페이지 URL 찾기
- Wrapper : HTML 페이지 -> XML DB
- CGI 스타일의
페이지에서는 Wrapper 만으로도 충분히 역할을 할 수
있음
- NLP Lib : 문장에서 정보 추출
- Free Style의 HTML, 설명형태의 글, 뉴스 등에서
특정 Fact 수집
- DB는
Text 보다 과거의 데이터를 담고 있음
- Information Cooking
- Identification : 문서 스타일 판별
- Segmentation : 문서의 구성요소 조각 나눔
- Classification : 문서내의 entity 범주화, 문서 범주화
- Clustering : 문서내의 entury 군집화, 문서 군집화
- Association : 문서내의 정보를 DB의 Field로 매핑
Page 6
Examples : Corpus
- Fletcher Maddox,
former Dean of the UCSD Business School, announced the formation of
La Jolla Genomatics together with his two sons. La Jolla Genomatics
will release its product Geninfo in June 1999. Geninfo is a turnkey
system to assist biotechnology researchers in keeping up with the voluminous
literature in all aspects of their field.
- Dr. Maddox will
be the firm's CEO. His son, Oliver, is the Chief Scientist and holds
patents on many of the algorithms used in Geninfo. Oliver's brother,
Ambrose, follows more in his father's footsteps and will be the CFO
of L.J.G. headquartered in the Maddox family's hometown of La Jolla,
CA.
Page 7
Examples : Entity
Persons:
Organizations:
Locations:
Artifacts:
Dates:
Fletcher Maddox
UCSD Business School
La Jolla
Geninfo
June 1999
Dr. Maddox
La Jolla Genomatics
CA
Geninfo
Oliver
La Jolla Genomatics
Oliver
L.J.G.
Ambrose
Maddox
Page 8
Examples : Attributes
Fletcher Maddox
Maddox
former Dean of the UCSD Business
School
his father
the firm's CEO
PERSON
Oliver
His son
Chief Scientist
PERSON
Ambrose
Oliver's brother
the CFO of L.J.G.
PERSON
UCSD Business School
NAME:
DESCRIPTOR:
CATEGORY:
NAME:
DESCRIPTOR:
CATEGORY:
NAME:
DESCRIPTOR:
CATEGORY:
NAME:
DESCRIPTOR:
CATEGORY:
ORGANIZATION
La Jolla Genomatics
L.J.G.
ORGANIZATION
Geninfo
its product
ARTIFACT
La Jolla
the Maddox family's hometown
LOCATION
CA
NAME:
DESCRIPTOR:
CATEGORY:
NAME:
DESCRIPTOR:
CATEGORY:
NAME:
DESCRIPTOR:
CATEGORY:
NAME:
DESCRIPTOR:
CATEGORY:
LOCATION
Page 9
Examples : Facts
PERSON
Employee_of
ORGANIZATION
Fletcher Maddox
Fletcher Maddox
Oliver
Ambrose
Employee_of
Employee_of
Employee_of
Employee_of
UCSD Business School
La Jolla Genomatics
La Jolla Genomatics
La Jolla Genomatics
ARTIFACT
Product_of
ORGANIZATION
Geninfo
Product_of
La Jolla Genomatics
LOCATION
Location_of
ORGANIZATION
La Jolla
Location_of
La Jolla Genomatics
CA
Location_of
La Jolla Genomatics
Page 10
Examples : Events
COMPANY:
La Jolla Genomatics
PRINCIPALS:
Fletcher Maddox
Oliver
Ambrose
DATE:
CAPITAL:
COMPANY:
La Jolla Genomatics
PRODUCT:
Geninfo
DATE:
June 1999
COST:
Page 11
Unstructured Data -> Strcutured/Semi-Structured
Data
- Task = Filling slots
in a database from sub-segments of text
- Techniques = Segmentation
+ classification + clustering + association
Page 12
Source Styles
Page 13
Segmentation
- Extract metadata
(e.g. author, title, date)
- Identify sections
(e.g. abstract)
- Extract keywords
Page 14
Clustering + Classification
- Document 내부에서
- 문서내의 Named Entity에 대해서 Entity Type을 판단
- 인명, 직책, 기관명, 날짜, 기관, 단위, 주소
- 제목, 나열형 문장, 설명형 문장
- 동일 데이터 형태가 나열된 경우에, 밝혀진 것과 동일한 패턴으로 나열된 데이터에 대해서 동일한 filed로 인정
- 여러 Document로부터
- 추출된 정보의 신뢰도를 측정 (문서의 중요도, 분야의 적합성)
- 다수의 Source에서 수집된 정보에 대해서 상호 비교
Page 15
Association
Page 16
Global vs Local Extrations
- Local Extraction
models
- 하나의 웹사이트로부터 정보를 추출
- 해당 사이트에 꼭 맞춘 형식화된
XML 스타일로 HTML 문서를 변환
- Global Extraction
models
- 많은 웹 사이트의 텍스트로부터 필드화된 정보를 추출
- 두 모델을 혼합
- Local model은 Global model의 학습용 데이터
혹은 정확도가 높은 초기 DB를 추출해 줄 수 있음
- Global model은 Local model에서 발생하지 않은
새로운 데이터나 새로운 필드를 추가해 줄 수 있음
Page 17
Information Extraction in Real
- CGI로 생성된 HTML 페이지
- 생성
: DB -> (CGI) -> HTML
- 리버스엔지니어링 : HTML -> (Crawler) -> (Wrapper)
-> DB
- News, Report
- 언어적인 분석을 통해서 Entity, Attribute, Fact, Event를 추출하여야 함
Page 18
Extracting Corporate Information
Data automatically
extracted from
marketsoft.com
Source web page.
Color highlights
indicate type of
information.
(e.g., red = name)
E.g., information
need: Who is the
CEO of MarketSoft?
Source: Whizbang!
Labs/
Andrew McCallum
Page 19
Product information
Page 20
Product information
Page 21
Canonicalization: Product information
Page 22
Wrappers
- 에이전트를 이용한 정보추출을 위해서는 각 문서에 대해서 추출하고자 하는 정보의 위치와 구조, 포맷 등을 나타내는 규칙이 필요하며 일반적으로 이러한 규칙을 wrapper라고 한다.
- Wrapper의 작성
- 수동 작성 : 정보 추출의 정확성을 높일수 있지만 문서가 변경되면 대책이 없음
- 자동 생성
: 도메인 지식과 샘플문서를 이용해서 자동 생성, 문서 변경에 대응
Page 23
Amazon Book Description
….
</td></tr>
</table>
<b class="sans">The Age of Spiritual
Machines : When Computers Exceed Human Intelligence</b><br>
<font face=verdana,arial,helvetica
size=-1>
by <a href="/exec/obidos/search-handle-url/index=books&field-author=
Kurzweil%2C%20Ray/002-6235079-4593641">
Ray Kurzweil</a><br>
</font>
<br>
<a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg">
<img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif"
width=90
height=140
align=left border=0></a>
<font face=verdana,arial,helvetica
size=-1>
<span class="small">
<span class="small">
<b>List Price:</b>
<span class=listprice>$14.95</span><br>
<b>Our Price: <font
color=#990000>$11.96</font></b><br>
<b>You Save:</b>
<font color=#990000><b>$2.99 </b>
(20%)</font><br>
</span>
<p> <br>…
Page 24
Extracted Book Template
Title: The
Age of Spiritual Machines :
When Computers Exceed Human Intelligence
Author: Ray
Kurzweil
List-Price: $14.95
Price: $11.96
:
:
Page 25
Wrappers: Simple Extraction
Patterns
- Specify an item
to extract for a slot using a regular expression pattern.
- Price pattern: “\b\$\d+(\.\d{2})?\b”
- May require preceding
(pre-filler) pattern to identify proper context.
- Amazon list price:
- Pre-filler pattern:
“<b>List Price:</b> <span class=listprice>”
- Filler pattern: “\$\d+(\.\d{2})?\b”
- May require succeeding
(post-filler) pattern to identify the end of the filler.
- Amazon list price:
- Pre-filler pattern:
“<b>List Price:</b> <span class=listprice>”
- Filler pattern: “.+”
- Post-filler pattern:
“</span>”
Page 26
Wrapper induction
Highly regular
source documents
Relatively simple
extraction patterns
Efficient
learning algorithm
- Writing accurate
patterns for each slot for each domain (e.g. each web site) requires
laborious software engineering.
- Alternative is to
use machine learning:
- 학습용 데이터 (문서와 사람이 만든 규칙 pair)를 구축한다.
- HTML 문서에서
각 항목의 주위에 나타나는 특정 패턴을 자동 학습한다.
Page 27
Use <B>,
</B>, <I>, </I> for extraction
<HTML><TITLE>Some
Country Codes</TITLE>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
Wrapper induction: Delimiter-based
extraction
Page 28
l1,
r1,
…,
lK,
rK
Example:
Find 4 strings
labeled
pages
wrapper
<HTML><HEAD>Some
Country Codes</HEAD>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
<HTML><HEAD>Some
Country Codes</HEAD>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
<HTML><HEAD>Some
Country Codes</HEAD>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
<HTML><HEAD>Some
Country Codes</HEAD>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
Learning LR wrappers
Page 29
LR: Finding
r1
<HTML><TITLE>Some
Country Codes</TITLE>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
r1
can be any prefix
eg
</B>
Page 30
LR: Finding
l1,
l2
and r2
<HTML><TITLE>Some
Country Codes</TITLE>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
r2
can be any prefix
eg </I>
l2
can be any suffix
eg
<I>
l1
can be any suffix
eg <B>
Page 31
Wrapper 생성기
: 전체 흐름도와 도메인 지식 표현
Page 32
Wrapper 생성기
: 전처리 – 논리라인 생성
- 브라우저
상에 출력되는 형태처럼 눈에
보이지 않는 HTML 태그를 제거하고
테이블 관련 태그(예를 들어, TR, TH
등)나 라인을 분리할 때 사용되는
리스트형 태그(예를 들어, BR, P, LI)를
기준으로 라인을 분리
Page 33
Wrapper 생성기
: 도메인 지식을 이용해서 논리 라인 의미분석
도메인 지식의 각 OBJECT에 대한 패턴을
논리라인으로부터 찾아서
일치하는 FORMAT을 기록한다.
Page 34
XML 규칙
생성
- 도메인 지식을 적용해서 HTML에서 특정 패턴에 대한 현상을 XML 문서로 기술하여 XML 파일로 저장한다.
- 이 XML로 기술된 정보추출 규칙에 따라서 해당 HTML 문서에서의 정보를 추출한다.
Page 35
Natural Language Processing-based
IE
- If extracting from
more natural, unstructured, human-written text, some NLP may help.
- Part-of-speech (POS)
tagging (품사 태깅)
- Mark each word as
a noun, verb, preposition, etc.
- Syntactic parsing
(명사구, 동사구, 관형어구)
- Identify phrases:
NP, VP, PP
- Semantic word categories
(e.g. from WordNet)
- KILL: kill, murder,
assassinate, strangle, suffocate
- Extraction patterns
can use POS or phrase tags.
- Crime victim: 누가 [ 죽였다 누구를]
- Prefiller: [POS:
V, Hypernym: KILL]
- Filler: [Phrase: NP]
Page 36
Finite state automata transductions
0
1
2
3
4
PN
’s
ADJ
Art
N
PN
P
’s
Art
John’s interesting
book with a nice cover
Pattern-maching
PN ’s (ADJ)* N P Art (ADJ)*
N
{PN ’s |
Art}(ADJ)* N (P Art (ADJ)* N)*
Page 37
Rule-based Extraction Examples
- Determining which
person holds what office in what organization
- [person] , [office]
of [org]
- Vuk Draskovic, leader
of the Serbian Renewal Movement
- [org] (named, appointed,
etc.) [person] P [office]
- NATO appointed Wesley
Clark as Commander in Chief
- Determining where
an organization is located
- [org] in [loc]
- NATO headquarters in
Brussels
- [org] [loc] (division,
branch, headquarters, etc.)
Page 38
Three generations of IE systems
- Hand-Built Systems
– Knowledge Engineering [1980s– ]
- 규칙을 직접 작성
- 해당 분야와 정보추출 시스템에 능통한 전문가가 필요
- { 추측
– 실험 -변경 } 을 반복함
- Automatic, Trainable
Rule-Extraction Systems [1990s– ]
- 미리 정의된 템플렛을 이용해서 규칙을 자동으로 발견하는 시스템
- 대규모의 labeled corpora가 필요
- Statistical Generative
Models [1997 – ]
- 문서에서 연관성이 있는 부분을 찾아내는 통계적인 모델 이용 - using HMMs or statistical parsers
- Learning usually supervised;
may be partially unsupervised
Page 39
Evaluating IE Accuracy
- 시스템 개발에서 사용되지 않은 사람이 직접 만든 테스트 데이터를 사용하여 성능을 측정
- Template Measure
for each test document:
- Total number of correct
extractions in the solution template: N
- Total number of slot/value
pairs extracted by the system: E
- Number of extracted
slot/value pairs that are correct (i.e. in the solution template): C
- Compute average
value of metrics adapted from IR:
- Recall = C/N
- Precision = C/E
- F-Measure = Harmonic
mean of recall and precision
Page 40
MUC: the genesis of IE
- DARPA funded significant
efforts in IE in the early to mid 1990’s.
- Message Understanding
Conference (MUC) was an annual event/competition where results were
presented.
- Focused on extracting
information from news articles:
- Terrorist events
- Industrial joint ventures
- Company management
changes
- Information extraction
of particular interest to the intelligence community (CIA, NSA).
- 참조 사이트
- http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html
Page 41
MUC Information Extraction:
State of the Art c. 1997
NE – named entity recognition
CO – coreference resolution
TE – template element construction
TR – template relation construction
ST – scenario template production
Page 42
Basic IE References
Dun & Bradstreet is the oldest, largest
most established seller of business info in the world. They maintain
a DB of all 11M US companies, and they do it very inefficiently: phone
calls.
We are extracting basic company identification
information, like name, address, phone, fax, email from over 10M domain
names.
Again, on left, original page, with markup
showing where WB extracted the DB fields, which are shown on right.
Again, formatting and position on page
is very indicative here. Relative position of entities says something
about how they go together---which person with which title, etc. |